17 research outputs found
Learning to Refine Human Pose Estimation
Multi-person pose estimation in images and videos is an important yet
challenging task with many applications. Despite the large improvements in
human pose estimation enabled by the development of convolutional neural
networks, there still exist a lot of difficult cases where even the
state-of-the-art models fail to correctly localize all body joints. This
motivates the need for an additional refinement step that addresses these
challenging cases and can be easily applied on top of any existing method. In
this work, we introduce a pose refinement network (PoseRefiner) which takes as
input both the image and a given pose estimate and learns to directly predict a
refined pose by jointly reasoning about the input-output space. In order for
the network to learn to refine incorrect body joint predictions, we employ a
novel data augmentation scheme for training, where we model "hard" human pose
cases. We evaluate our approach on four popular large-scale pose estimation
benchmarks such as MPII Single- and Multi-Person Pose Estimation, PoseTrack
Pose Estimation, and PoseTrack Pose Tracking, and report systematic improvement
over the state of the art.Comment: To appear in CVPRW (2018). Workshop: Visual Understanding of Humans
in Crowd Scene and the 2nd Look Into Person Challenge (VUHCS-LIP
Lucid Data Dreaming for Video Object Segmentation
Convolutional networks reach top quality in pixel-level video object
segmentation but require a large amount of training data (1k~100k) to deliver
such results. We propose a new training strategy which achieves
state-of-the-art results across three evaluation datasets while using 20x~1000x
less annotated data than competing methods. Our approach is suitable for both
single and multiple object segmentation. Instead of using large training sets
hoping to generalize across domains, we generate in-domain training data using
the provided annotation on the first frame of each video to synthesize ("lucid
dream") plausible future video frames. In-domain per-video training data allows
us to train high quality appearance- and motion-based models, as well as tune
the post-processing stage. This approach allows to reach competitive results
even when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the video object segmentation task a smaller
training set that is closer to the target domain is more effective. This
changes the mindset regarding how many training samples and general
"objectness" knowledge are required for the video object segmentation task.Comment: Accepted in International Journal of Computer Vision (IJCV
Learning to segment in images and videos with different forms of supervision
Much progress has been made in image and video segmentation over the last years. To a large extent, the success can be attributed to the strong appearance models completely learned from data, in particular using deep learning methods. However, to perform best these methods require large representative datasets for training with expensive pixel-level annotations, which in case of videos are prohibitive to obtain. Therefore, there is a need to relax this constraint and to consider alternative forms of supervision, which are easier and cheaper to collect. In this thesis, we aim to develop algorithms for learning to segment in images and videos with different levels of supervision. First, we develop approaches for training convolutional networks with weaker forms of supervision, such as bounding boxes or image labels, for object boundary estimation and semantic/instance labelling tasks. We propose to generate pixel-level approximate groundtruth from these weaker forms of annotations to train a network, which allows to achieve high-quality results comparable to the full supervision quality without any modifications of the network architecture or the training procedure. Second, we address the problem of the excessive computational and memory costs inherent to solving video segmentation via graphs. We propose approaches to improve the runtime and memory efficiency as well as the output segmentation quality by learning from the available training data the best representation of the graph. In particular, we contribute with learning must-link constraints, the topology and edge weights of the graph as well as enhancing the graph nodes - superpixels - themselves. Third, we tackle the task of pixel-level object tracking and address the problem of the limited amount of densely annotated video data for training convolutional networks. We introduce an architecture which allows training with static images only and propose an elaborate data synthesis scheme which creates a large number of training examples close to the target domain from the given first frame mask. With the proposed techniques we show that densely annotated consequent video data is not necessary to achieve high-quality temporally coherent video segmentation results. In summary, this thesis advances the state of the art in weakly supervised image segmentation, graph-based video segmentation and pixel-level object tracking and contributes with the new ways of training convolutional networks with a limited amount of pixel-level annotated training data.In der Bild- und Video-Segmentierung wurden im Laufe der letzten Jahre große Fortschritte erzielt. Dieser Erfolg beruht weitgehend auf starken Appearance Models, die vollständig aus Daten gelernt werden, insbesondere mit Deep Learning Methoden. Für beste Performanz benötigen diese Methoden jedoch große repräsentative Datensätze für das Training mit teuren Annotationen auf Pixelebene, die bei Videos unerschwinglich sind. Deshalb ist es notwendig, diese Einschränkung zu überwinden und alternative Formen des überwachten Lernens in Erwägung zu ziehen, die einfacher und kostengünstiger zu sammeln sind. In dieser Arbeit wollen wir Algorithmen zur Segmentierung von Bildern und Videos mit verschiedenen Ebenen des überwachten Lernens entwickeln. Zunächst entwickeln wir Ansätze zum Training eines faltenden Netzwerkes (convolutional network) mit schwächeren Formen des überwachten Lernens, wie z.B. Begrenzungsrahmen oder Bildlabel, für Objektbegrenzungen und Semantik/Instanz- Klassifikationsaufgaben. Wir schlagen vor, aus diesen schwächeren Formen von Annotationen eine annähernde Ground Truth auf Pixelebene zu generieren, um ein Netzwerk zu trainieren, das hochwertige Ergebnisse ermöglicht, die qualitativ mit denen bei voll überwachtem Lernen vergleichbar sind, und dies ohne Änderung der Netzwerkarchitektur oder des Trainingsprozesses. Zweitens behandeln wir das Problem des beträchtlichen Rechenaufwands und Speicherbedarfs, das der Segmentierung von Videos mittels Graphen eigen ist. Wir schlagen Ansätze vor, um sowohl die Laufzeit und Speichereffizienz als auch die Qualität der Segmentierung zu verbessern, indem aus den verfügbaren Trainingsdaten die beste Darstellung des Graphen gelernt wird. Insbesondere leisten wir einen Beitrag zum Lernen mit must-link Bedingungen, zur Topologie und zu Kantengewichten des Graphen sowie zu verbesserten Superpixeln. Drittens gehen wir die Aufgabe des Objekt-Tracking auf Pixelebene an und befassen uns mit dem Problem der begrenzten Menge von dicht annotierten Videodaten zum Training eines faltenden Netzwerkes. Wir stellen eine Architektur vor, die das Training nur mit statischen Bildern ermöglicht, und schlagen ein aufwendiges Schema zur Datensynthese vor, das aus der gegebenen ersten Rahmenmaske eine große Anzahl von Trainingsbeispielen ähnlich der Zieldomäne schafft. Mit den vorgeschlagenen Techniken zeigen wir, dass dicht annotierte zusammenhängende Videodaten nicht erforderlich sind, um qualitativ hochwertige zeitlich kohärente Resultate der Segmentierung von Videos zu erhalten. Zusammenfassend lässt sich sagen, dass diese Arbeit den Stand der Technik in schwach überwachter Segmentierung von Bildern, graphenbasierter Segmentierung von Videos und Objekt-Tracking auf Pixelebene weiter entwickelt, und mit neuen Formen des Trainings faltender Netzwerke bei einer begrenzten Menge von annotierten Trainingsdaten auf Pixelebene einen Beitrag leistet
Exploiting saliency for object segmentation from image level labels
There have been remarkable improvements in the semantic labelling task in the
recent years. However, the state of the art methods rely on large-scale
pixel-level annotations. This paper studies the problem of training a
pixel-wise semantic labeller network from image-level annotations of the
present object classes. Recently, it has been shown that high quality seeds
indicating discriminative object regions can be obtained from image-level
labels. Without additional information, obtaining the full extent of the object
is an inherently ill-posed problem due to co-occurrences. We propose using a
saliency model as additional information and hereby exploit prior knowledge on
the object extent and image statistics. We show how to combine both information
sources in order to recover 80% of the fully supervised performance - which is
the new state of the art in weakly supervised training for pixel-wise semantic
labelling. The code is available at https://goo.gl/KygSeb.Comment: CVPR 201
One-Shot Synthesis of Images and Segmentation Masks
Joint synthesis of images and segmentation masks with generative adversarial
networks (GANs) is promising to reduce the effort needed for collecting image
data with pixel-wise annotations. However, to learn high-fidelity image-mask
synthesis, existing GAN approaches first need a pre-training phase requiring
large amounts of image data, which limits their utilization in restricted image
domains. In this work, we take a step to reduce this limitation, introducing
the task of one-shot image-mask synthesis. We aim to generate diverse images
and their segmentation masks given only a single labelled example, and
assuming, contrary to previous models, no access to any pre-training data. To
this end, inspired by the recent architectural developments of single-image
GANs, we introduce our OSMIS model which enables the synthesis of segmentation
masks that are precisely aligned to the generated images in the one-shot
regime. Besides achieving the high fidelity of generated masks, OSMIS
outperforms state-of-the-art single-image GAN models in image synthesis quality
and diversity. In addition, despite not using any additional data, OSMIS
demonstrates an impressive ability to serve as a source of useful data
augmentation for one-shot segmentation applications, providing performance
gains that are complementary to standard data augmentation techniques. Code is
available at https://github.com/ boschresearch/one-shot-synthesisComment: Accepted as a conference paper at IEEE Winter Conference on
Applications of Computer Vision (WACV) 202
Generating Novel Scene Compositions from Single Images and Videos
Given a large dataset for training, GANs can achieve remarkable performance
for the image synthesis task. However, training GANs in extremely low data
regimes remains a challenge, as overfitting often occurs, leading to
memorization or training divergence. In this work, we introduce SIV-GAN, an
unconditional generative model that can generate new scene compositions from a
single training image or a single video clip. We propose a two-branch
discriminator architecture, with content and layout branches designed to judge
internal content and scene layout realism separately from each other. This
discriminator design enables synthesis of visually plausible, novel
compositions of a scene, with varying content and layout, while preserving the
context of the original sample. Compared to previous single-image GANs, our
model generates more diverse, higher quality images, while not being restricted
to a single image setting. We show that SIV-GAN successfully deals with a new
challenging task of learning from a single video, for which prior GAN models
fail to achieve synthesis of both high quality and diversity
Divide & Bind Your Attention for Improved Generative Semantic Nursing
Emerging large-scale text-to-image generative models, e.g., Stable Diffusion
(SD), have exhibited overwhelming results with high fidelity. Despite the
magnificent progress, current state-of-the-art models still struggle to
generate images fully adhering to the input prompt. Prior work, Attend &
Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming
to optimize cross-attention during inference time to better incorporate the
semantics. It demonstrates promising results in generating simple prompts,
e.g., ``a cat and a dog''. However, its efficacy declines when dealing with
more complex prompts, and it does not explicitly address the problem of
improper attribute binding. To address the challenges posed by complex prompts
or scenarios involving multiple entities and to achieve improved attribute
binding, we propose Divide & Bind. We introduce two novel loss objectives for
GSN: a novel attendance loss and a binding loss. Our approach stands out in its
ability to faithfully synthesize desired objects with improved attribute
alignment from complex prompts and exhibits superior performance across
multiple evaluation benchmarks. More videos and updates can be found on the
project page \url{https://sites.google.com/view/divide-and-bind}.Comment: Project page: \url{https://sites.google.com/view/divide-and-bind